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Abstract 



When using mixture models it may be the case that the modeller 

I ^1 has a-priori beliefs or desires about what the components of the mix- 
ture should represent. For example, if a mixture of normal densities is 

►^ to be fitted to some data, it may be desirable for components to focus 

(yQ on capturing differences in location rather than scale. We introduce 

"O a framework called proximity penalty priors (PPPs) that allows this 

,_: preference to be made explicit in the prior information. The approach 

1^ is scale-free and imposes minimal restrictions on the posterior; in par- 

^^ ticular no arbitrary thresholds need to be set. We show the theoretical 

^—1 validity of the approach, and demonstrate the effects of using PPPs 

. . on posterior distributions with simulated and real data. 

^ Keywords: Bayesian; Identifiability; MCMC; Mixture Model; Prior 

S^ Specification. 

1 Introduction 

Mixture models are widely recognized as a useful tool for inference in a va- 
riety of settings. Having been first used over 100 years ago (for example, in 



Pearson 1894), more recently mixture models are enjoying a revival, thanks 



to advances in computational methods for inference. In particular, the EM 



algorithm (Dempster et al. , 1977) and MCMC (see, for example, Diebolt and 



Robert, 1994[) have driven considerable advances in the field. See McLach 



Ian and Peel (2000) for a general overview of mixture models; Fruhwirth 



Schnatter (2006) provides an overview of Bayesian mixture models, which 



are the focus of this paper. 

We recall the definition of a mixture model and introduce notation. Sup- 
pose n observations, yi, . . . ,yn, are taken from a i^-component mixture dis- 
tribution where all the components have the same distributional form, with 
mixture-specific parameters = {0i, . . . , 0k), global parameters r/ and mix- 
ing weights TT = (vTi,. . . ,71k), summarised by 7 = (77,^,77). The mixture 
distribution for a single observation Y^ is then given by 



K 



givih) = ^TTkfk{yi\0k,'n), 



(1) 



fc=i 



with K > 1, TTk > {k = 1,2,...,K), J2k- 



\0k 



^TTfc = 1 and fk{-\ifk,'n} is a 
density function parametrised by 0k and 77. 

A Bayesian approach to estimating the parameters of the mixture distri- 
bution of Equation ([I]) involves the specification of priors for the parameters 
7. The issue of prior specification in this context has a number of difficulties. 

First, fully improper priors cannot be used for component-specific param- 
eters in mixture models, since doing so causes the posterior to be improper 
also (see, for example, McLachlan and Peel, 2000). However, proper priors. 



even with large variance, can have considerable infiuence on the posterior 



distribution, and the extent of this infiuence can be difficult to assess (Marin 



et al. , 2005). Re-parametrisation in a hierarchical manner and allowing only 



the global parameters to be improper is one solution: this is considered by 
Mengersen and Robert (1996), and Roeder and Wasserman ( |1997 ). Another 
possibility is to use data-dependent priors, as considered by [Richardson and 



Green (1997), and Wasserman (2000) 



Second, where no component specific information is available, identical 
priors may be proposed for the components of each parameter. This leads to 
a non-identifiable posterior, which is known as the label switching problem. 
This has been well studied (see, for example Stephens, 2000 Jasra et al. 



2005 Sperrin et al. , 2010, and references therein). 

Third, constructing independent priors for component parameters may 
not be sensible, as the components only have meaning relative to one another 



(Leeet al. 2008) 



This third issue is the focus of this paper. We consider in detail the 
idea that priors should be specified relative to each other. We introduce a 
strategy for doing so that we call 'proximity penalty priors' (PPPs). The 
basic idea is that priors are specified in two parts: first, each prior is spec- 
ified independently, corresponding to standard existing approaches; second, 
a proximity penalty is applied, which penalises the joint prior distribution of 
certain configurations of parameters. We show that the construction makes 
theoretical sense. 

Section |2] introduces the idea of PPPs. Section |3] illustrates the conse- 
quences of the PPP approach on real and simulated data; the paper concludes 
with a discussion in Section HI 

2 Proximity Penalty Priors 

We begin with a simple result that establishes the validity of the PPP ap- 
proach. 

Proposition 1. Suppose the prior for 'j, given by p{'y), can be separated as 

P{l) = Pi (7)^2 (7) • 

Denote the likelihood by L{-f) and the posterior by q{'y), so that 5(7) ex 
-^(7)^(7)- Suppose that a new parameter vector 7* can be simulated from 
a proposal distribution r{'y*) = L{'y*)pi{'y*) , and the existing value of 'j is 
7™. Then if we set 

■m+i ^ J 7* with probability min (^1, g^j ^^s 

7™ otherwise, 

the result is equivalent to a Metropolis-Hastings update. 

Proof. The acceptance probability for the Metropolis-Hastings procedure 
with proposal density r(-) and posterior g(-) is 

. , , q{i*)r{Y 

mm 1, 



Substituting in these densities gives the result. D 



In the context of this work the portion of the prior Pi{-) corresponds to the 
independent specification of the parameters, for which standard distributions 
could be used; the portion P2{-) corresponds to the novel part of the prior 
that jointly assesses the values of the parameters and penalises undesirable 
combinations. 

Suppose that the priors Pi(-) are conjugate. Then an MCMC approach 
would proceed, on each iteration, by generating proposed new parameters 
according to a Gibbs sampling scheme with the full conditionals based on the 
prior component pi{-), then accepting the proposed parameters according to 
a Metropolis Hastings ratio on the prior component P2{-)- 

We illustrate the idea with an example. Consider a mixture of two normal 
distributions 

P iVih) = 7riiV(2/,; /ii, af) + 'K2N{yi] /i2, crl), (3) 

with 111 + 112 = 1, and all the parameters 7 = (tti, 112, [ii, 112-1 cf, al) unknown. 
Standard conjugate prior choices would then be a Dirichlet distribution for 
the pair (7ri,7r2), normal distributions for jii and /i2, and inverse-gamma 
distributions for a\ and cr|. Throughout this paper we will use the empirical 



Bayes prior distributions suggested by Richardson and Green (1997) unless 
otherwise stated. We may believe a-priori that the key difference between 
the two components is the location. If the components are not well separated 
or the amount of data is small it is important that such prior information is 
captured. By Proposition 1, we can reflect these beliefs in a separate part of 
the prior P2{-)- A sensible such choice is 

^2(7) = 1^1 -^2!- (4) 

Such a function assigns more prior weight to larger differences between /ii and 
/i2- In isolation, the above P2(-) is improper but provided pi(-) is proper the 
overall prior is proper. Such a prior enjoys scale invariance in the sense that 
P2{axi) / p2{aX2) = P2{xi)/p2{x2) for all non-zero a. This may or may not 
be desirable. An alternative would be to specify a distance 5 as a minimum 
distance between /xi and /i2, i-e. 



^2(7) = 1 



{|/il-/i2|><5)- 



This generates the question of how 6 should be specified, but may be appro- 
priate in some situations. 



More generally, for a mixture distribution with K parameters, suppose 
there exists a component-specific parameter 0^ for each component k = 
1, . . . ,K, and the difference between the components is a-priori believed (or, 
from the point of view of model interpretation, desired) to be in terms of this 
parameter. Then we propose setting 

P2h) = min \(j)k-(j)i\. (5) 

On the other hand, for a mixture distribution with K parameters, if there 
exists a component-specific parameter ipk for each component k = 1, . . . ,K, 
and each component is a-priori expected or desired to have similar values of 
this parameter, we could set 

^2(7) =max|^fc- V^,|~\ (6) 

Here, the scale free nature of p2(') is an advantage in that we do not have 
to quantify 'similar'. More generally, ^2(7) could be constructed as any 
multiplicative combination of Equations (|5| and (|6|. The procedure can also 
be applied when the number of components K is allowed to vary, in which 
case it makes sense only within fixed values of K in the same way that the 



label switching problem only has meaning within fixed values of K (Nobile 



and Fearnside, 2007). 



3 Examples 

3.1 Mixture of Two Normals 

Our first illustration takes the simple mixture of two normals example. We 
generate 100 observations from the density given in Equation ([s]), with /ii = 
0, /i2 = 2, cr^ = cr| = 1 and tti = 112 = 0.5. We consider two prior specifica- 
tions: 



(a) the standard specification given in Richardson and Green (1997), de- 
noted without PPP\ 



(b) a two part prior ^(7) = Pi (7)^2 (7)5 with ^1(7) as given in Richardson 
and Green (1997) and ^2(7) as given in Equation (|4]), denoted with 
PPP. 



In both cases we fix the number of components K = 2. In (b), we are 
therefore adding an exphcit prior opinion that the difference between the 
two components is in the locations fii and fi2- 

Figure [T] compares a bivariate projection of the posterior onto the absolute 
difference |/ii — fi2\ and max(cr^, (j|) without and with the PPP. Without the 
PPP, posterior mass is assigned to the situation where |/ii — fi2\ is small and 
max(o"f , erf) is large. This corresponds to a case where a mixture distribution 
with similar means but different variances is fitted. In Figure [2] we see that 
such a mixture is well supported by the data (dashed line in the figure). Once 
the PPP is applied, far less posterior mass is assigned to this scenario, since 
our prior distribution specifically tells us to exclude such cases. 

Figure p^ gives the marginal bivariate posterior of (/ii, ^12), with and with- 
out the PPP. Without the PPP, the posterior appears to have a single mode 
at approximately /xi = /i2 = 1; with the PPP, the posterior is bimodal with 
modes at approximately (/ii = 0,/i2 = 2) and (/xi = 2,fi2 = 0). The bimodal- 
ity in the PPP case is a consequence of label switching; if component-specific 
inference is required, post-hoc relabelling should be carried out (see, for ex- 



ample, Sperrin et al. , 2010). The unimodality in the non PPP case is caused 



by the two means being very close together and the variances to differ, cor- 
responding to a different interpretation of the mixture components. 

We also ran the same comparison without assuming a fixed number of 



components K (using the birth-death method of Stephens 2000), putting a 



Poisson(l) prior distribution on the number of components K (see Nobile and 



Fearnsidej 2007, for a justification of the use of this prior). Similar results to 
the above were observed when we looked at the output conditional on i^ = 2 



3.2 Galaxy Data 

The galaxy dataset is commonly used to illustrate mixture modelling tech- 
niques (see Jasra et al. , 2005 , for a recent investigation of this dataset in 
the mixture modelling context). Briefiy, it consists of the velocities of 82 
galaxies, but the velocities appear to cluster, suggesting different groups of 
galaxies that we may wish to identify (see Figure 111). If we model these data 
using a mixture, it is likely that we wish our mixture components to represent 
the clusters with different mean velocities, hence the PPP of Equation ([s]) 
could be considered in this scenario. We run a variable dimension sampler 
with the details as above, with normally distributed components assumed 
and a Poisson(l) prior distribution on the number of components K. We 



6 




(a) without PPP 




(b) with PPP 
Figure 1: Posterior contour plots of |/ii — fi2\ versus niax(crJ,o"|) 
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Figure 2: Histogram of 100 realisations from 0.5A^(0, 1)+0.5A^(2, 1) with true 
density overlaid (solid line) and alternative density, 0.5A^(1, 1) + 0.5A^(1,4) 
also overlaid (dashed line) 



compare the results of standard priors (i.e. those given in Richardson and 



Green 1997) with the standard priors plus the PPP. Both with and without 
the PPP, the values of K with the majority of posterior support are K = ?> 
and K = A (but see Aitkin , 2001 , for discussion on the posterior of the num- 
ber of components in a mixture model). For the K = ?> case the posterior 
means are already well separated, and the PPP has little or no effect on the 
posterior means. We look in more detail at the K = A case. 

In order to avoid the label switching issue, we first consider the poste- 
rior of a generic /Xfc without relabelling, estimating this by combining into 
a single vector all samples from the posterior /x^, for k = 1,2,3,4, condi- 
tional on i^ = 4. We can do this since invariance of the posterior under 
re-parametrisation means we can ignore the labels. The resulting density 
plot is given in Figure [5| The interesting difference to note here is that 
with the PPP four distinct peaks can be observed in the density, whereas 
without the PPP the middle two peaks cannot be distinguished. This does, 
however, depend on the smoothing parameter used in the non-parametric 
density estimate. 

To consider this further we mitigate the label switching issue by applying 
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(a) without PPP 



iS* ^ 




(b) with PPP 
Figure 3: Posterior contour plots of fii versus ^2 
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Figure 4: Histogram of the velocities of 82 galaxies 
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Figure 5: Smoothed density of a generic /Zfc for the galaxy data. Without 
PPP: dashed hne; with PPP: sohd hne. 



the identifiabihty constraint /zi < /i2 < /is < /i4, then look at the posterior 
density of (/is — /i2)- This is given in Figure [6] We see that applying the 
PPP causes more separation between the two component means (less mass 
at small differences). 



4 Discussion 

In this paper we have introduced the idea of incorporating weak joint infor- 
mation about parameters in a mixture model into the prior specification. In 
particular we have introduced proximity penalty priors (PPPs) as a method 
of explicitly declaring an a-priori opinion (or interest) in components that 
differ on a certain parameter. The formulation is designed to allow this opin- 
ion to be as vague as possible: we avoid making any statement about the 
magnitude of the difference that should be observed between the components, 
i.e. the method is scale-free. 

With the focus of this paper being introduction of the idea, the exam- 
ples were kept fairly simple. The idea, however, is very general and could 
be applied in more complex models. For example, in an application such as 
genetics we may wish to construct a mixture of regressions with many co- 
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Figure 6: Smoothed density of {^3 — ^2) for the galaxy data with K = 4 
after an IC is apphed. Without PPP: dashed hne; with PPP: sohd hne. 

variates. Suppose there are p covariates and K mixtures, with the coefficient 
of the j^^ covariate in the k^^ mixture given by /3jk- Then we could consider 
the PPP 

^2(7) = maxmin \f3jk - f3ji\, 

j k^l 

to reflect a belief that each component should have at least one coefficient 
that differs from the value in every other component. 

Another potential extension is to replace the Li-norm assumed in the 
PPP with an L^-norm, i.e. considering a generalisiation of, for example. 
Equation Q, to 

^2(7) = 1/^1 -/^2|'- 

In this generalised setting, we note that s = clearly corresponds to an 
unpenalised prior and s = 1 reduces to the original Equation Q. Also, 
setting s = — 1 encodes a PPP like Equation (|6]). This generalisation then 
raises the question of how should s be chosen? We suggest s = 1 is a very 
natural choice, since this means the penalty is being applied on the original 
scale of the data. We have, however, looked at the sensitivity to the choice 
of s. For the example considered in Section [3} once s becomes large the 
posteriors for /j, become very fiat. 
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